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MATHEMATICAL  PROGRAMMING  METHODS 
OF  PATTERN  CLASSIFICATION 

By 

Richard  C.  Grinold 

Division  of  Engineering  and  Applied  Physics 
Harvard  University  Cambridge,  Massachusetts 


ABSTRACT 

This  paper  studies  four  mathematical  programming  methods 
which  are  useful  in  pattern  classification.  Two  of  the  models  are  for 
linearly  separable  problems,  while  the  others  work  without  separability. 


i 


V  " 


i 


INTRODUCTION 


This  report  is  designed  to  supplement  "On  Pattern  Classification- 

>{c 

Introduction  and  Survey,  "  [7j,  by  describing  several  mathematical  pro¬ 
gramming  approaches  to  the  classification  problem.  We'll  assume  that 
the  reader  is  familiar  with  the  Ho  and  Agrawala  paper  (at  least  sections 
I,  II,  and  IV)  and  draw  on  the  motivation,  notation,  and  definitions  used 
there. 


Four  mathematical  programming  models  are  described  in  detail, 
and  two  more  are  mentioned  briefly.  Others  exist,  and  are  referenced 
in  the  publications  cited  here.  The  four  models  were  selected  for  their 
computational  and  conceptual  properties. 

Before  describing  the  contents  of  the  paper,  we'll  expand  upon  and 
change  some  of  the  notation  adopted  in  [7]. 

Definitions: 


1  T  0  T 

(i) .  Instead  of  x  (i)  and  x  (j)  ,  the  training  samples  from 

classes  one  and  zero  will  be  denoted  by  m  component 

row  vectors  a|  and  A®, 

k  k 

(ii) .  For  k  =  0, 1;  A  is  the  n^  x  m  matrix  whose  rows  are  A^, 

i  =  1,  2,  « . , «  ,  n^ 

(iii) .  h,  X,e,q,  and  f  are  vectors  of  ones.  There  dimensions 

are  given  below. 

h  ~  n^  x  1  ;  f~nQxl»  e  ~  m  x  1  ;  q  ~  n^*  n^  x  1  and 


f~  n  x  1,  where  n  =  n^+n 


O’ 


S|t  ’  _  _ 

Y.  C.  Ho  and  A.  K.  Agrawala,  Technical  Report  No.  557,  Division  of 
Engineering  and  Applied  Physics,  Harvard  University  March  (1968),  also 
published  in  IEEE  Trans,  on  Auto.  Cont.  Vol.  13,  No.  6,  December  1968 
and  Proceedings  of  the  IEEE,  Vol.  56,  No.  12,  December  1968. 
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(iv).  A  linear  decision  function  it  an  (m+1)  x  1  vector  w  = 
Using  these  definitions  we  note  that 


Suppose  the  patterns  are  described  by  an  m  component  row  vector  x, 
then  the  decision  function  defined  by  w  is 


f  (x)  =  X  +  xu  =  (1,  x)w. 


(v).  A  linear  decision  function  w  is  a  separator  if 

Aw>  0 


Problems  are  specified  by  their  range  of  attributes. 

The  ranges  are  defined  below. 

(vi) .  S1  =  {x|  P(x|  H1)  >  0} 

S°  *  {x|P(x|  H°)  >  0] 

(vii) .  The  operator  C  will  denote  convex  closure.  Thus  C(S*) 

is  the  closed  convex  hull  of  S*. 

(viii).  The  problem  is  separable  if  C(S°)  and  CfS1)  are  disjoint. 
If  they  intersect  the  problem  is  nonseparable. 

(ix) .  The  problem  is  decidable  if  S®  and  S1  are  disjoint. 

Ip 

(x) .  We  shall  define  C(A  )  as  the  convex  hull  of  the  rows  of 

Ak,  k  =  0,1. 

Thus 

C(A°)  -  {bjb  *  zA°,  zt  =  1,  z  *  0} 

C(A*)  *  fbjb  *  yA*,  yh  =  1,  y  ^  0} 


0 


Five  sections  and  an  appendix  follow.  Sections  one  and  two 
describe  models  used  in  the  separable  and  nonseparable  cases.  The 
third  section  remarks  on  the  model's  flexibility  in  terms  of  accommodat¬ 
ing  new  data  and  use  in  judging  new  features.  Section  four  considers  the 
model's  generalization  properties,  while  the  last  section  describes  an 
application.  The  appendix  is  a  brief  introduction  to  linear  and  quadratic 
programming. 

Of  the  four  models,  two  have  been  described  in  the  published 
literature.  One  of  the  unpublished  models  is  due  to  Canon  and  Cullum  [3], 
the  other  is  the  author's  responsibility  [6], 
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I.  THE  SEPARABLE  CASE 

Charne*  [4]  end  Mangaearian  [9]  Independently  propoeed  a  linear 
progranunlng  model  for  eeparatlng  disjoint  polyhedron*.  Their  distinct 
approaches  illustrate  the  duality  principle  of  linear  programming.  Charnes 
asks  If  the  sets  C(A®)  and  C(A*)  are  disjoint,  while  Mangasarian  looks 
directly  for  a  separating  hyperplane. 

By  definition,  the  C(A^)  will  intersect  if  and  only  if  the  system  (I) 
has  a  feasible  solution. 

sA*  -  yA®  *  0,  yh  «  !,  si  ■  1,  y  *  0,  r  5  0  (I) 

We  can  discover  a  solution  of  (1)  by  adding  artifical  variables  to  the  system  < 
and  minimising  the  infeas Utility.  This  gives  us  a  linear  program. 

Minimise  (r  4  s)e  (2) 

Subj.  to 

sA°  -  yAl  +  rl  -  si  «  0 
si  si 

yh  si 

ilO,  y  i  0,  ri  O,  s  *  0 

This  problem  has  m  +  2  equality  constraints  with  n  4  2m  nonnegative 
variables.  The  value,  (r  4  s)e,  is  nonnegative  since  r  and  s  are  non- 
negative.  Finally,  we  can  easily  construct  a  first  basic  feasible  solution 
of  (2). 

The  alternate  approach  involves  the  decision  function  directly. 

Suppose  the  m  vector  u  and  scalars  (y,  6)  satisfy  the  following  conditions: 
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6  -  y  >  0 
-Alu  -  hy  3  0 
A°u  +  16  *  0 

than  w  ■  (^-g  -  ,  u)  it  a  separator, 
by  solving  (4). 

Maximize  6  -  y 

Subj.  to 

-A^u  -  hy  3  0 
+A°u  +  16  *  0 
-e  *  u  S  e 

Problem  (4)  has  n  inequality  constraints,  two  free  variables  (y,  6),  and 
m  variables  with  upper  and  lower  bounds.  The  bounds -rule  out  infinite 
solutions.  Evidently,  (y,  6,u)  *  (0,0,0)  is  a  feasible  solution  of  (4). 

Appealing  to  the  results  in  the  appendix  we  can  state  that  problem 
(4)  is  the  dual  of  problem  (2).  and  the  duality  theorem  applies.  This 
guarantees  the  existence  of  optimal  solutions  (£,  y,  r,  i)  and  (y.6,u)  such 
that: 

(?  +  s)e  «  6  -  y  *  0. 

There  are  two  possibilities.  If  6  -  y  >  0,  then  y-  ,  u)  is  a 
separator.  If  e(r  +  s)  e  0,  then  (z,y)  solves  (1),  and  the  convex  hulls 
intersect.  These  facts  are  summarised  below. 

(5) 


(3) 

A  solution  of  (3)  can  be  discovered 

(4) 


Theorem: 

(1). 


Problems  (2)  and  (4)  have  optimal  solutions  with  equal, 
nonnegative  values. 
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(ii).  If  the  optimal  value  la  zero,  the  patterns  are  not 
linearly  separable. 

v  4*  6 

(iii).  If  the  optimal  value  is  positive,  then  w  =  (J-j — ,  XT) 
defines  a  separator  which  maximizes 

Min[A^w|  i  =  1,2 . n] 

Subj.  to 

-1  5  Wj  *  1  ,  for  j  =  1,  2, . . .  ,m 

Statements  (i),  (ii),  and  the  first  part  of  (iii)  are  established  above.  The 
final  statement  can  be  established  by  contradiction. 

The  linear  programs  will  be  solved  using  some  variant  of  Dantzig's, 
[2],  simplex  method.  This  is  a  rapidly  convergent  combinatorial  procedure, 
while  the  adaption  algorithms,  see  [?]  Table  I,  are  gradient  descent  tech¬ 
niques  which  converge  slowly.  If  the  patterns  are  not  separable,  slow  and 
nonconvergence  can  be  confused.  See  [9].  pg.  451,  for  a  more  detailed 
comment  along  this  line.  The  adaption  algorithms  do  have  the  advantage 
of  simplicity,  but  this  is  largely  offset  by  the  wide  availability  of  profes¬ 
sionally  written  linear  programming  codes.  Either  (2)  or  (4)  can  be  solved, 
but  the  simplex  algorithm  is  more  efficient  with  fewer  nontrivial  constraints. 
It  is  not  very  sensitive  to  the  number  of  variables.  Since  m  +  2  «  n  it  is 
reasonable  to  solve  (2). 

Canon  and  Cullum  [3]  have  proposed  a  quadratic  programming  method 
for  the  separable  case.  Although  it  is  generally  more  difficult  to  solve 
quadratic  programs,  the  authors  take  advantage  of  the  problem's  special 
structure  and  claim  their  method  is  competitive  with  the  linear  programming 
model. 


For  each  i  and  j,  i  *  1, 2,  . . . ,  n^,  j  *  1, 2,  . . ,  nQ;  we  can  define 
a  difference  vector 

Dk  s  Ai  “  Aj)  for  k  *  lf  2 . nlV 

The  vector*  are  the  row*  of  the  n^n^  x  m  matrix  O.  Recall 

C(D)  &  (u|u  =  yD,  yg  =  1,  y  *  O).  It  i*  easy  to  establish  that  C(A°)  and 

c(aS  will  be  separable  if  and  only  if  the  origin  is  not  contained  in  C(D). 

This  suggests  a  test  for  separability:  find  the  vector  in  C(D)  with 
minimum  norm.  The  problem  can  be  written  in  two  ways: 

Minimize  -y  (6) 

Subj.  to 

yD  -  ul  *  0 
yg  *  1 

y  *  0 

Minimize  (7) 

Subj.  to 

yg  B  1 

yiO 

The  following  facts  about  (6)  and  (7)  should  be  clear:  they  are  equivalent, 
the  objectives  are  convex  and  quadratic,  they  have  optimal  solutions  with 
nonnegative  values,  and  the  sets  are  separable  if  and  only  if  the  optimal 
value  is  positive. 

It  is  well  known  that  any  point  in  C(D)  can  be  expressed  as  a  convex 
combination  of  at  most  mil  rows  of  D,  This  fact  is  used  to  reduce  the 
problem's  size.  The  algorithm  solves  a  modified  version  of  (6),  restricting 
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attention  to  a  subset  of  m  +  1  rows.  A  test  if  the  restricted 
solution  is  optimal  with  all  rows  considered.  If  so,  (6)  is  solved.  If 
not,  a  new  row  is  added,  an  old  row  dropped,  and  the  algorithm  proceeds, 
finding  an  optimal  solution  in  a  finite  number  of  steps.  The  optimal  solu¬ 
tion  of  (6)  defines  the  linear  decision  surface. 

Suppose  (y,ti)  solves  (6).  u  t  0,  and 

V  *  Min  {aJu'|  i  *  1,2,  . . ,  n^} 

6  =  Max  f/ ju'J  j  *1,2,  . .  ,  nQ} 

then  ,  u)  is  a  separ  ator.  If  u  *  0,  no  separator  exists.  This  is 

demonstrated  in  the  appendix  using  the  Kuhn-Tucker  theorem.  Canon 
and  Cullum  do  the  same  by  showing  problem  (6)  is  equivalent  to: 

Max  [Min  [uz|  ucC(D)}]  (8) 

Subj.  to 

z  Iz$  1 

n.  NONSEPARABLE 

One  approach  to  the  nonseparable  case  was  taken  in  [6].  A  descrip¬ 
tion  will  require  two  definitions. 

n  A 

ZAi 

—  be  the  average  of  the  rows  of  A. 
i=l 

(xii).  For  any  decision  function  w,  let  the  quality  of  w  be  defined 

as 


Min  [A.w|  i  s  1, 2 . n] 


r 
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If  w  is  a  separator,  the  quality  is  positive.  If  w  is  not  a  separator, 
the  negative  of  the  quality  (a  nonnegative  number)  measures  the  largest 
error  the  decision  surface  makes.  To  obtain  a  decision  surface  of  highest 
quality  we  solve 

Maximize  {Min[A^w|  i  e  1,2 . .  n]} 

Subj.  to  aw  =  1 

The  constraint  is  a  normalization. 

This  problem  can  be  transformed  into  a  linear  program  by  intro¬ 
ducing  a  new  variable  P  and  requiring  P  £  A^w  for  i  =  1, 2, . . . ,  n.  The 
new  problem  and  its  dual  are  given  below. 

Maximize  p  (9) 

Subj.  to 

Aw  -  fp  ?  0 
aw  =1 

Minimize  y  (10) 

Subj.  to 

yA  -  ya  =  0 
yf  =1  / 

y  £  0 

Problem  (10)  has  m  +  2  equality  constraints,  n  nonegative  variables, 
and  one  free  variable. 

The  main  result  of  [6]  is: 

Theorem  (11) 

(1).  Problems  (9)  and  (10)  have  optimal  solutions  with  equal 

objective  values  iff  and  a  ^  0.  When  a  ■  0,  (9)  is  infeasible. 
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(ii) .  If  (w,  p)  solves  (9)  and  p  >  0,  then  w  defines  a  separator 

of  maximum  quality. 

(iii) .  If  (w,p)  solves  (9)  and  p  $  0,  then  the  patterns  are  not 

separable  and  w  defines  a  decision  surface  that  minimizes 
the  maximum  error. 

This  is  equivalent  to  (5)  in  the  separable  case.  In  addition,  a  meaningful 
decision  surface  is  generated  if  the  patterns  are  not  separable. 

Smith,  [13],  has  another  approach.  Note  that  Aw>  0  has  a  solution 
iff  Aw  5  f  has  a  solution,  in  this  spirit,  we  can  solve 

I 

Minimize  f'v 

Subj.  to  ^ 

Aw  +  Iv  £  f 
v*  0 


The  Vj's  measure  the  size  of  any  error  in  the  classification  of  the  ith 
sample.  Thus  if  A^w  £  1,  there  is  no  error  and  v^  =  0.  If  A.w<  1,  v,  is 
positive.  There  is  some  difficulty  if  0  <  A^w  <  1,  In  this  case  the  pattern 
is  correctly  classified,  but  an  error  is  counted.  This  behavior  is  observed 
in  optimal  solutions. 

The  dual,  (12),  is  a  linear  program  with  m  +  1  equality  constraints 
and  n  nonnegative  variables  with  upper  bounds.  It  is  relatively  easy  to 
solve,  [5], 

Maximize  yf  (12) 

Subj.  to 

y A  =  0 

0  *  y  3  f' 


t 
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This  model  suggests  several  conceptually  interesting  but  computa¬ 
tionally  difficult  variations.  For  instance,  we  could  minimize  the  sum  of 
squared  errors.  This  leads  to  a  quadratic  program: 

I 

Minimize  v  Iv 

Subj.  to 

Aw  +  Iv  5  f 
v  *  0 

Another  variant  maximizes  the  number  of  correctly  classified  samples: 

n 

Maximize  6(A.w) 
i=l 

Subj.  to  -1  =  w.  ^  1  i  *  0,1,2,  . . .  ,  m 

6(* )  is  the  step  function;  one  if  its  argument  is  positive,  zero  otherwise. 
This  problem  can  be  reformulated  as  an  integer  program,  [12]  pp. 

194-8. 

Another  method  of  treating  the  nonseparable  case  was  proposed  by 
Mangasarian,  [10 ].  The  approach  is  similar  to  Arkadev  and  Braveman  [l], 
i.  e.  a  piecewise  linear  decision  surface  is  created  which  decides  correctly 
about  all  the  data.  Mangasarian  uses  mathematical  programming  to  con¬ 
struct  the  decision  surface.  We  will  not  examine  that  algorithm  in  detail, 
but  we  do  comment  on  its  generalization  properties  in  section  4. 

in.  FLEXIBILITY 

This  section  examines  the  ability  of  the  different  models,  (2),  (o), 
and  (12)  to  handle  new  data  and  yield  information  useful  in  selecting  new 
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features.  Models  (2),  (10),  and  (12)  can  accept  new  data  points  and  find 
a  new  decision  surface  easily.  In  each  case,  adding  a  new  point  is 
equivalent  to  introducing  a  new  activity  (column)  into  the  linear  program. 

Model  (6)  has  a  similar  property.  For  example,  suppose  a  new 
point  in  class  one,  is  observed.  This  adds  n^  new  rows  to  the 

matrix  O.  If 


Ai  ,,  u'  5  Min  [A*  u'|  i  =  1,  2 . ,  n.] 

nj+i  n^  x 


no  change  is  needed,  the  old  decision  surface  is  still  optimal.  If  the 
inequality  does  not  hold,  we  continue  to  apply  the  Canon-Cullum  alg  *ithm 
until  a  new  optimal  solution  is  obtained. 

Introducing  a  new  feature  in  (2),  (10),  or  (12),  ^  adds  a  new  constraint 
(row)  to  the  linear  program.  If  several  new  features  are  being  considered, 
we  can  devise  a  heuristic  rule  for  chosing  among  them.  Try  the  current 
optimal  solution  for  each  new  constraint.  Select  the  constraint  which  is 
the  furtherest  from  being  satisfied.  If  the  optimal  solution  satisfies  all 
the  new  constraints,  it  is  still  optimal.  This  selects  the  feature  which 
maximizes  the  rates  of  improvement  of  the  solution.  Then  a  new  optimal 
solution  can  be  obtained  using  the  dual  simplex  method. 


^  There  doesn't  seem  to  be  any  way  that  new  feature  can  be  accommodated 
by  model  (6). 


f 
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IV.  GENERALIZATION 

The  generalization  properties  of  the  models  are  examined  in  this 
section.  In  particular,  we  are  interested  in  the  decision  surfaces  gener¬ 
ated  as  the  number  of  sample  points  n  becomes  large.  For  each  n  the 
models  produce  a  decision  surface  defined  by  a  nonzero  m  +  1  vector. 
Without  loss  of  generality  we  can  uniformly  bound  these  vectors.  Thus, 
there  will  be  subsequences  which  converge.  We  shall  study  the  properties 
of  the  limiting  decision  surface. 

For  example,  assume  C(S*j  and  C(S^)  are  disjoint  with  one  set 
compact,  and  consider  model  (2).  Let  (Xn,  u11)  be  the  normalized  optimal 
decision  surface  for  the  n  sample  problem  and  let  (X,u)  be  a  limiting 
surface:  i.  e.  (Xn,  un)->(X,  u)  on  some  subsequence.  The  following  theorem 
asserts  (X,u)  is  optimal  for  the  limiting  problem. 

Theorem: 

With  probability  one  (wp*  1)  there  exists  a  p  >  0  such  that  (X,  u,  p) 

solve: 

Maximize  p 

Subj.  to 

X+xu-p^O  xc  C(S^) 

-X  -  xu  -  p  -  0  xc  C(S^) 

-e  -  u  -  e 


Proof: 


There  exists  a  hyperplane  which  strictly  separates  C(sS  and  C(S^). 


Therefore  the  problem  has  an  optimal  solution  with  positive  value. 
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Suppose  (X,u)  and  some  p  >  0  are  not  optimal.  A  contradiction 
can  be  established  by  appealing  to  the  the  facts  that  (X,  u)  is  (2)  feasible 
for  ail  n,  and  that  (X,u)  is  the  limit  of  a  subsequence  of  optimal  solutions. 

Three  comments  are  in  order.  First  it  is  obvious  that  similar 
results  hold  for  models  (6),  (10),  and  (12).  Secondly,  if  compactness 
is  dropped  a  weaker,  pS  0,  statement  is  true.  Finally,  if  separability 
doesn't  hold,  then  (wp,  1)  all  models  will  indicate  this  for  some  large 
value  of  n. 

Assuming  decidability  we  could  obtain  a  like  result  using  the 
piecewise  approach,  [10].  Additional  regularity  assumptions  are  needed 
to  allow  a  piecewise  linear  function  defined  by  a  finite  number  of  hyper¬ 
planes.  Without  decidability,  the  piecewise  approach  would  struggle  in 
vain  to  produce  a  perfect  decision  function. 

Model  (10)  will  work  in  the  separable  case,  but  it  has  questionable 
generalization  properties.  It  is  very  sensitive  to  the  tails  of  the  distribu¬ 
tions.  The  decision  surface  minimizes  the  maximum  error,  therefore  it 
will  react  to  the  worst  points  or  prehaps  to  a  faulty  observation.  Things 
can  get  worse. 

Let  a^  be  the  finite  means  of  the  distributions,  P(x|H^)  for  k  =  0,1. 
Then  the  row  average  of  A  will  converge  (wp»  1)  to 

/P(H l)  a1  -  P(H°)  a° 

a  =  1  0 
^P(HA)  -  P(HU) 

The  following  is  an  example  of  what  can  go  wrong.  Suppose  P(H*)  >  P(H^), 
and  the  sets  described  below  have  a  nonvoid  intersection: 


t 
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L  =  { d)  d  =  ya,  Y  -  0 
Z  ={d|  d  -b  e  C(S0)} 

then  the  limiting  optimal  solution  is  giv'  n  by  w  =  ( - 7— - - x  ,  0)  i.  e. 

P(fT)  -  P(HU) 

the  decision  function  is 

f(x)  - j— ^ - x  >  0  for  all  x 

P(H  )  -  P(HUJ 

The  fact  that  f  is  correct  more  than  not  offers  little  consolation.  Note 
that  this  phenomenon  will  occur  if  S®  =  S*  =  Rm,  and  P(H^)  £  P(H*)t  e.  g. 
multivariate  normal. 

The  generalization  properties  of  (12)  seem  to  be  the  best.  It  is  a 
reasonable  conjecture  that  the  limiting  decision  surfaces  of  (12)  are 
optimal  solutions  to  the  following: 

Minimize  F(w) 

Subj.  to 

-1*  WjS  l  i  =  0,1,2,  ...  ,m 

where 

F(w)  =  P(Hl)  J  (-X-ux)P(xJ  H1)  dx  +  P(H°)  J  (X+ux)P(x|  H°)  dx 
X°  x1 

is  the  expected  error  distance.  It  is  also  reasonable  to  assume  that  the 
limiting  decision  surfaces  of  the  integer  program  mentioned  in  section 
two  will  minimize  the  probability  of  error  among  all  linear  decision  func¬ 
tions.  A  brief  attempt  was  made  to  prove  these  conjectures,  but  the  proof 


is  elusive. 
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V.  EXAMPLE 

Models  (10)  and  (12)  were  employed  to  design  decision  functions 
using  data  from  a  NASA  biomedical  experiment.  Two  types  of  electro¬ 
encephalograms  (brainwaves,  EEG)  were  recorded.  In  one  instance  the 
subject  was  watching  a  strobe  light.  In  the  other  case  the  light  was  not 
visible.  The  object  is  to  distinguish  the  two  cases  using  the  EEG  data. 

Of  a  possible  one  hundred  features  K.  Prahbu  selected  five,  using 
a  distance-dispersion  technique  and  prepared  the  data  for  the  linear  pro¬ 
gramming  models.  The  parameters  were  n^  -  165,  n^  =  155,  m  =  5,  n  =  320, 
and  the  problems  were  solved  on  an  IBM  360-65  using  the  mathematical 
programming  package,  MPS  360,  [ll].  Results  are  tabulated  below. 


Model  (10)  Solution  Time  0.  09  min. 


Errors 

Number 

Percentage 

Type  I 

25 

16.6 

Type  II 

21 

12.7 

Total 

46 

14.4 

Model  (12)  Solution  Time  0.  92  min. 


Errors 

Number 

Percentage 

■analS* 

18 

11.6 

Type  II 

24 

14.5 

Total 

42 

13.1 
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Notice  the  performance  of  model  (12)  is  slightly  better  although 
the  solution  time  is  longer.  Both  problems  had  unique  optimal  solu¬ 
tions  and  31  of  the  points  were  incorrectly  classified  by  both  techniques. 


APPENDIX 


Linear  and  Quadratic  Programming 

Several  results  from  mathematical  programming  have  been  used 
in  this  report.  This  appendix  attempts  to  motivate  and  explain  these 
results  while  citing  more  substantial  references. 

A  linear  program  is  an  optimization  problem 

m 

Min  }  x.c, 

4fa*  J  J 

j=l 

Subj.  to  m 

2Xjaji  =bi  *  =  1,2f  ** 
j-1 

xJO  j  =  1,  2,  . .  ,m 

Our  vector  notation  is 

Min  xc 

Subj.  to 

xA  =  b 
x  0 

c  is  m  x  1,  A  m  x  n,  and  b  1  x  n.  We  shall  call  this  problem  the  primal. 
There  is  an  associated  dual  problem: 

Max  by 

Subj.  to 


Ay  -  c 


* 
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Linear  programs  appear  in  many  forms:  maximization  or  minimization, 
equalities  or  inequalities,  nonnegative  or  unrestricted  variables.  Any 
problem  can  be  transformed  into  the  same  form  as  our  primal,  which 
allows  us  to  know  its  dual.  The  dual  can  be  found  directly  using  the 
diagrams  on  pp.  126-7  of  [2]. 

An  efficient  algorithm  known  as  the  simplex  method,  has  been 
devised  to  solve  linear  programs.  In  a  finite  number  of  steps  it  finds 
a  feasible  solution  (if  one  exists),  then  again  in  a  finite  number  of  steps 
it  determines  an  optimal  or  an  unbounded  solution.  An  optimal  dual  solu¬ 
tion  is  supplied  as  a  by  product  of  the  calculations. 

The  principle  theoretical  result  in  linear  programming  relates 
primal  and  dual. 

Theorem:  [2]  pg.  129 

If  both  primal  and  dual  have  feasible  solutions,  they  have  optimal 
solutions  (x,y)  such  that 

xc  s  by 

We  shall  discuss  quadratic  programming  in  the  context  of  problem  (6). 

x..  ulu' 

Min  — - — 

Subj.  tb 

yD  -  ul  =  0 

yg  =  1 
y  0 

A  central  result  in  the  study  of  these  problems  is  the  Kuhn-Tucker  theorem, 
[8].  In  our  case  it  states: 
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Theorem: 

(y.u)  la  optimal  for  (6)  if  and  only  if  there  exist  (x,  z ,  X)  such  that 


gX  +  Dx  +  z  *  0 

U '  -  X  a  0 

y  -  0 

z  £  0 

yg  =  i 

yD  -  til  *  0 
yz  =  0 

Suppose  u  /  0  is  optimal  in  (6),  then  ulu'  >  0.  Juggling  the  above 
equations  we  can  easily  establish  that 


X  =  -ulu'  <  0 


and 


Du  5  g(ulu')  >  0. 
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